scRNA seq attempts to identify the expression level of all the genes in each cell from an heterogeneous population. The starting point for to analyze snRNA-seq experiments in this application will be tables, or read count matrices, containing the number of copies detected of each type of mRNA in each individual cell.
To ilustrate each of the analysis steps we will use a 10X Genetics experiment on aproximately 10.000 peripheral blood mononuclear cells
First of all you need to load the data you want to work with to the application. Click the "Browse" button in the "Upload data" section and select the files you want to upload, if you want to upload folders you can compress them in a zip file and when they are uploaded they will be automatically extracted.
This app can read single cell RNA-seq data in several formats:
Click on the buttons “Select File” or “Select Folder” to choose an input dataset, and the click on “Load Data”.
There’s also the option to load additional metadata -for instance, from a previous analysis- selecting an excel file or a plaint text table (‘.csv’, ‘.tsv’, ‘.txt’) with extra cell information. The only requisite is that the cell IDs in the dataset and de selected table match. A notification will pop up when not all the cells of the dataset are included in the meta data table.
Finally, since some analysis steps, like cell type imputation, require the results of differenteial gene expression and this can be a time consuming process, there is also the option to load previously calculated DE results from an RDS or excel file.
After loading the dataset, a brief summary will appear at the bottom:
In the rest of sections of the application, the ‘Help & Info’ box at the top right will display the project name and the number of cells in the dataset:
Merging or integrating datasets is a necessary step to compare and contrast cell groups minimizing possible batch effect or technical biases. In this app there are two types of data merging implemented:
‘Simple merge’ will combine the raw count matrices of the experiments, without performing any correction.
‘Sample integration’ will try to correct for technical differences (such as batch effects) by identifying sets of cells (anrchors) that are in a similar state across datasets.
Normalization is a pre-processing step for expression data that tries to remove technical variability while preserving biological differences, so the variance of a gene reflects biological heterogeneity. There are two normalization methods implemented:
Do a quick quality assesment of the cells in your dataset and determine thresholds to filter low quality cells. In this section you can check the distribution of read counts, number of genes per cell, the percentage of reads corresponding to mitocondrial or ribosomal RNA. In addition to that, theres also the possibility to estimate the influence of the cell cycle on the cells.
One of the easiest ways to assess the quality of the cells in the dataset is to take a look at the number of read counts (“nCount_RNA”) and genes (“nFeature_RNA”) detected in each one.
In addition to that, comparing gene count with the percentage of mitocondrial or ribosomal RNA (“percent.mt” and “percent.ribo”) is another simple way to pinpoint low quality cells. In most cases, high percentages of mtRNA will be caused by dying or degraded cells, issues in sample preparation, etc.
Plots of basic quality measurments, number of reads and genes per cell, and percentages of mitocondrial and ribosomal reads
As is shown in the image above, a minimum threshold of 1000 detected genes per cell will filter out most of those low-quality cells. On the other hand, the percentange of ribosomal RNA separates the cells in two groups. In situations like this, it can be interesting to check if this has any relation to cell type in the sample.
Additionally, a test with a few basic default filters will be applied, but its important to take into account that this is just an example and the particular values used could be unfit for the dataset:
Plots of basic quality measurements after automatic filtering
Cell cycle driven gene expression can have an important impact in single-cell RNA-seq experiments.
Down below are the cell cycle scoring results of a bone marrow dataset where cell cycle phase has an significant influence on the gene expression profiles of the cells (source data from Nestorowa et al., Blood 2016). In cases like this, the cells in the dataset will separate clearly by their estimated cell cycle phase.
In a case like this, cell cycle related genes will have an effect on the normalized data used for dimensional reduction and cell clustering, and subsequently on all steps of the analysis than rely on them (differential expression, determining cell types, etc). Using the cell cycle scoring function we can estimate the cycle phase of every cell based on the expression of genes liked to G2M or S phase.
In some situations there is an interest in keeping the variability due to cell cycle heterogeneity: for instance, in a developmental context, where changes in cell proliferation are related to different stages of the cells from stemness to terminal differentiation. More often than not, though, there is more interest in minimizing the influence of cell cycle phase, so using the scoring functionality will allow to regress the impact of call cycle related gene expression.
It is a good idea to come back to this section after clustering or sample merging to check if our cell groups are biased by read counts, number of detected genes, etc. Differences in mtRNA or ribosomal RNA don’t necessarily have to be linked to random noise technical bias.
In contrast to the bone marrow dataset, here are the results of cell cycle scoring for the peripheral blood mononuclear cell dataset we’ve been using as an example:
As we can see, in this case there isn’t a clear separation of the cells assigned to each phase, we would not expect a strong influence of cell cycle phase on the following steps of the analysis.
In this section we can discard cells from the dataset or make subsets.
The first set of filters allows do define limits to read and gene counts, percentage of mtRNA and ribosomal RNA. Eahc time there’s a change in the dataset the values in the boxes will update, by default displaying the maximum and minimum value for each variable.
The second set of filters is for any other cell metadata that the dataset might have associated -for instance, if this is a 2nd analysis of the experiment, or if other analysis have been done-. Numerical variables will take a minimum-maximum range and categorial variables, a selection box will be deployed to select the values of interest.
This filters can be used to select cells to keep or to remove cells from the dataset. Some examples:
Any other additional metadata imported in the ‘Load data’ tab can be used as a filter -for instance, mutation or sequence variability annotations-. The last section of this tab is dedicated to variable regression and count normalization. Variable regression is done to try to compensate for the effects of a known variable in the read counts, for example: cell cycle phase, abundance of mitocondrial and ribosomal RNA, or Y chromosome expression -to avoid differential expression results being biased due to a mix of XX and XY samples-.
Normalization is a pre-processing step for expression data that tries to remove technical variability while preserving biological differences, so the variance of a gene reflects biological heterogeneity. There are two normalization methods implemented:
This step in the analysis will most of the times be the foundation for results of a scRNA-seq analysis, and so, small differences can impact greatly the conclusions gathered from the data -number of cell types in the samples, differential expression between treatments or conditions, etc-.
Grouping elements in increasing numbers of cluster. Clustering
criteria will influence our conclusions about the data: there is a need
to keep a balance between a cluster’s robustness and reproducibility and
its capacity to be informative.
Even though it might sound
counter-intuitive, very granular classifications like D
can be detrimental if we don’t have enought information on the
biological context to interpret the results and assess their relevance.
On the other hand, situations like A can be
appropiate for exploratory analysis, but when it’s possible we should
aim to uncover knowledge that is only reachable through techniques like
scRNA-seq.
In this section we can explore the influence that several parameters can have in grouping cells together. Is important to take into account that even though running a clustering algorithm will separate the cells in categorical “boxes” -each of the clusters-, the underlying data is not categorical, but quantitative, and more often than not a set of clusters can be the result of dividing a continuous spectrum of cell states.
The most straightforward example of this effects are differentiation processes, in which we can have a continuous spectrum ranging from progenitors to terminally differentiated cells. In cases like this the underlying data might not have clear barriers between each of the differentiation stages, so different clustering parameters will set different borders between clusters. The plots below represent different clustering results of an scRNA-seq of 4-day embryoid bodies, with pluripotent cells on the left side and hemangiogenic progenitors on the right:
This app implements the evaluation of two main parameters: number of principal components and resolution.
¿What are principal components? When we process large datasets made of many correlated variables (in our case, the counts of each gene), our first step will try to make the analysis less computationally demanding by “summarizing” our data in a more manageable number of variables.
This is what’s called dimensionality reduction, and the most common algorithm of this type is Principal Component Analysis or PCA, which consists on calculating linear combinations of the expression of several genes so that the variability of our dataset of hunderds or thousands of cells can be represented in a few dozens of variables, called principal components.
By default the PCA algorithm will generate 50 principal components, but depending on the complexity of our particular sample we will usually need less than that to characterize our dataset.
To estimate an optimal number of principal components for our dataset we can follow several criteria:
Elbow plot: Decrease in standard deviation of principal components
This can be done through visual inspection of an elbow plot representing the standard deviation of each component (example shown above) or establising a limit in the decrase rate. In orther to find this “elbow” or optimal PC number, our app will calculate at which point the fall in standard deviation is less than 5% for three components in a row. In the example plot above, the optimal number of PCs returned would be 12.
Another way to estimate the optimal PC number is to observe the changes in clustering results depending on the different values of the parameter. Relying too much on 2D representations can lead to taking a decision based on biases in visual perception or preconceptions of our data -e.g.: if I expect to have 3 cell types in my sample, I might favor parameters that produce 3 groups of cells-. To help evaluate the changes in clutering we have implemented two measurements:
Snapshot of the interactive cluster flow plot for 9-19 PCs
And here would be the corresponding 2D representations, with red
arrows pointing to the two re-organized clusters:
This application uses the SNN modularity optimization based clustering algorithm implemented in the package Seurat, in which a relevant parameter is ‘resolution’. Clustering resolution will determine the granularity of the resulting cell groups, with low resolutions returning large and general clusters, and higher resolutions producing the opposite. There isn’t an optimal, objective resolution. This will depend on the dataset’s size and diversity.
In this section the resulting plots are similar to those evaluating the number of principal components, depicting co-clustering conservation and clustering flow:
Testing different resolutions from 0 to 1, we get the following co-clustering conservation plot:
Where we can see that there are major reorganizations at several points. This re-organizacions are much more clear in the cluster flow plot:
As is shown in the
plot, with resolutions larger than 0.5 we start to see that “unstable
re-organization” of several of the clusters as we ask for more granular
divisions. This is what usually happens when the differences between
cells are less categorical and more continuous, so the border between
clusters end up varying a lot.
The optimization phase of this clustering algorithm has a stochastic component, which means that, from the same data, we could get slighly different results each time the cells are clustered. The specific result we get each time is determined by the ‘seed’ used, usually a random number. Using a set of seeds and comparing the results is possible to examine how much variation there is among clustering results, which tells us how ambiguous or uncertain the classification of each cell is. This uncertainty score is designed to make easier to identify:
Oversplitting: taken to the extremes, clustering results can range from an unique cluster containing all the cells, to each cell having its own cluster. When the resolution selected is too high for the dataset, the clustering algorithm will start to find spurious boundaries that at the end of the day are not representative of meaninful differences between cell groups. A sign that this is happening are uncertainty scores being generally high across all cells.
Undersplitting: on the other hand, if the resolution chosen if too low, the cell clusters defined will be too general to be informative down the line, but cluster asignment will be quite robust. An example of this in our PBMC dataset would be choosing a resolution so low that we ended up with two groups: monocyte-like and lymphocyte-like cells.
Uncomplete datasets or multiplets: when an isolated cell is assigned a particularly high uncertainty score, it’s usually due to that cell having an expression profile in between two clusters. If we’re working with a small dataset, our issue could be that we’re missing uncommon intermediate states, and thus that cell is tenuously linked to related cell groups. On the other hand, this can also happen when a cell’s read counts are actually the result of the aggregation of two or more cells into a single droplet during cell capture, and thus the erratic cluster assignment responds to the mix of expression profiles of two cell types.
Continuous processes: if a particular section of the dataset presents a concetration of high uncertainty scores, we migh be dividing a cells representing a dynamic spectrum of changes in gene expression. At the start of this section we mentioned the case of differentiating cells. Another example are cells in continuous tissues, like endothelial cells forming arteries, capilaries, and veins, which commonly form adjancent groups if all three are present in the sample.
Calculating clustering uncertainty
In the image shown below we display the distribution of uncertainty scores for each cell at each resolution value tested as violin plots, while the black dots represent the mean score for each cluster. Since this is a large dataset, uncertainty scores were calculated using 25 iterations only. By looking for local minimums -a resolution of 0.3 in this example-, these uncertainty scores can help us balance the trade offs between a classification limited enough to be robust, and detailed enough to be biologically informative. Uncertainty scores and cluster asignments at each resolution can be exported to be examined in more detail.
In this tab, generate a definitive clustering for the cells in the dataset. The clustering results will be stored as part of the cell meta data, which can be used in any of the other sections of the app (filtering, differential expression, cell type imputation, general plotting, etc. )
Clustering parameters can be chosen manually or determined automatically based on the dataset size, although using the tools to evaluate clustering parameters is recommended. If the parameter selection is set to fully automatic, the clustering algorithm will be run for three resolution values (low, medium, and high, with the particular values depending on dataset size)
As is explained in the last chapter, the optimization phase of this clustering algorithm has a stochastic component that we can take advantage of to measure the similarity of clustering results across several iterations given a set of parameters, named here as clustering uncertainty. If this option is selected, uncertainty scores will also be stored as cell meta data.
In our PBMC example, we will proceed setting the parameters indicated in the previous section: 15 principal components and a clustering resolution of 0.3. In addition to store the results internally, the app will also display the distribution of cluster sizes, as well as uncertainty scores if those are calculated:
To avoid the possible biases due to evaluating clustering results by the 2D visualizations the option to plot the results its deactivated by default, but this can be changed click on the “Generate 2D reduction plots” checkbox. Here are the clustering results for the PBMC dataset plotted over its UMAP reduction:
As we can see here,
the most “uncertainly clustered” cells are in two groups, one situated
in a transitional area, the border between cluster 0 and 1, and on the
other hand, the cells in cluster 12, which is ocasionally merged with
cluster 5. Additionally the limits of cluster 2 also contain a few
uncertainly clustered cells
UMAP and tSNE plots are the one of the most recongnizable images linked to single cell analysis. UMAP and tSNE are two non-linear dimensional reduction techniques widely use in machine learning for their capacity to generate easy to interpret 2D maps from very large datasets, which makes them extremely attractive to display highly dimensional scRNA-seq information. As it happens with clustering, dimensional reduction techniques like UMAP and tSNE have an stochastic component as well, so a particular visualization is one of the many possibile ways to position the cells.
In this tab we have two sections:
Selecting Explore options will generate a collection of several possible reductions for the dataset using different random seeds or testing a range of values for the algoritm’s parameters.
The option Generate an embedding for the dataset will let the user adjust the value of the main parameters for UMAP or tSNE reduction algorithms and store the cell coordinates in the dataset -although this coordinates can also be exported to a CSV file independently-.
tSNE and UMAP are the most widely used dimensionality reduction algorithms, and each algorithm produces slightly different results depending on each dataset’s particular characteristics: how many cells types we have, how large is difference between those cell types, how are the cell types related to eachother, etc. None of the analysis tools implemented in this app will use the UMAP or tSNE coordinates as input, so in this context the choice is mainly relevant as a convinient way to display analysis results. Other programs, like some used for trajectory analysis, might do use tSNE or UMAP reductions as input.
Here are two brief guides on how to interpret tSNE and UMAP reductions and what are the strengths and shortcomings of each particular algorithm.
For UMAP or tSNE, define the most influential hyperparameters in each algorithm:
Perplexity (tSNE) and Number of neighbors (UMAP): This two variables effectively control how many neigboring cells are taken into account when constructing the embedding. On one hand, large numbers will prioritize global relationships between cell groups but might lose fine detail. On the other hand, small values of ‘perplexity’ or ‘number of neighbors’ will put more enphasis on the local structure.
Minimum distance (UMAP): this parameter controls how tight the cells will ‘clump’ together in the final plot. Larger values will make the points more disperse and viceversa.
Selecting Default values will set the parameter values and random seed to the default values used in the package Seurat: * Perplexity and number of neighbors = 30 * Minimum distance = 0.3 * Seed for tSNE = 1 * Seed for UMAP = 42
Selecting Set based on dataset size for the general parameter definition will define a range of values for each parameter.
Here is an example with a set of parameter values used to generate an UMAP reduction:
And an example of 6 different random seeds using default parameter values (Number of neighbors =30, min. distance=0.3):
In this tab we can estimate which genes have significantlly different expression patterns between groups of cells. Cells in the dataset can be grouped by their value in one or two variables, for instance, ‘sample’ and ‘condition’, and most commonly, their assigned cluster.
In this app there are three modes of analysis implemented. The difference between them is how are the foreground and background groups defined:
‘1 VS rest’: the gene expression of each group of cells will be compared to the rest of the dataset.
‘1 VS 1’: groups will be compared pair by pair. This can take a long time and most of the 1 VS 1 comparisons can be irrelevant, so in order to speed the process up it’s possible to choose two particular groups of cells to compare. This mode of analysis can be very useful to explore the difference between similar or adjacent clusters.
When groups are formed using two variables, let’s say ‘treatment’ and ‘cluster’, there is also the option to do a conditional comparison, E.g. inside each cluster, compare treatment A and treatment B.
Possible comparisons depending on the mode of analysis
Differential expression results will often be the basis for posterior analysis, such as like cell type imputation in this context, or any other applied to differentially expressed genes generally (functional annotation, transcription factor interaction, etc.)
To both speed up computation and minimize the proportion of low-relevance genes in the results, users can sete several parameters:
Minimum % of cells expressing a gene: ensure that only genes consistently expressed in either the foreground or the background are tested. Differential expression test wil be done for every gene meeting the minimum percentage of cells expresing said gene in either the foreground or background groups.
Minimum ±Log2( Fold Change ): ignore small changes in expression. Set this value to 0 to report differential expression for all genes, but take into account that, since we will be testing large cohorts of individual cells, minor changes in expression might be statistically significant, but not have the magnitude for the difference to be biologically relevant.
Minimum number of cells: skip comparisons where the foreground or background don’t have at least a minimum number of cells.
Remove mitocondrial and ribosomal genes: when any of this options is selected, ribosomal or mitocondrial genes will be excluded from the differential expression tests.
The app will display a summary with the parameters and the number of genes DE:
In our PBMC example we get the following results:
| Comparison | Cells.in.Foreground | Cells.in.Background | Total.DEG | Up.regulated.Genes | Down.regulated.Genes |
|---|---|---|---|---|---|
| 0 VS rest | 2025 | 18733 | 2197 | 1589 | 608 |
| 1 VS rest | 1711 | 19047 | 2145 | 1738 | 407 |
| 2 VS rest | 1448 | 19310 | 1892 | 321 | 1571 |
| 3 VS rest | 1288 | 19470 | 1327 | 353 | 974 |
| 4 VS rest | 1287 | 19471 | 2034 | 354 | 1680 |
| 5 VS rest | 745 | 20013 | 1194 | 461 | 733 |
| 6 VS rest | 724 | 20034 | 1556 | 384 | 1172 |
| 7 VS rest | 588 | 20170 | 1607 | 1349 | 258 |
| 8 VS rest | 473 | 20285 | 1426 | 702 | 724 |
| 9 VS rest | 404 | 20354 | 564 | 550 | 14 |
| 10 VS rest | 394 | 20364 | 1368 | 350 | 1018 |
| 11 VS rest | 189 | 20569 | 1376 | 1273 | 103 |
| 12 VS rest | 137 | 20621 | 683 | 244 | 439 |
| 13 VS rest | 101 | 20657 | 2055 | 1794 | 261 |
Below the summary, the top differentily expressed genes are displayed. Select the one of the comparisons available and the results of the 20 genes with the smallest p-value will be shown.
For each comparison the program will generate a table to store the results. For each gene the following variables are stored:
Lastly, a heatmap will be generated using the top 5 DE genes in each cluster displaying the log-normalized expression in all the dataset. Below is an example using a subset of the PBMC dataset with just three clusters:
This section of the app is dedicated to generate different types of plots from the dataset:
Create plots displaying log-normalized read counts for up to 4 genes. If there is a 2D reduction stored in the dataset, gene expression can be projected onto the tSNE or UMAP reductions, otherwise violin plots will be generated. To generate and store a 2D reduction of the dataset, use the tools in the 2D visualizations tab
Example of 2D projection (A) and violin plot (B) for the gene LCK in the PBMC dataset
If violin plots are selected, there also the posibility to split the violing by one or two metadata variables (E.g: ‘cluster’, ‘sample’, etc):
When projecting the meta data into 2D reductions, if any variable is stored as numerical, a prompt will appear asking wheter to treat this variables as categorical (an example of this could be a ‘Sample’ variable with values ‘1’, ‘2’, ‘3’… meant to be interpreted as cualitative, not cuantitative)
When generation violin or barplots, if the metadata variables selected
are categorical (E.g. ‘cluster’), a barplot is created, displaying the
percentage of cells in the dataset with each value. If the variables are
numerical, violin plot are generated. As with gene expression plots, for
numerical variables it’s possible to select additional grouping
variables to split the violin plots.
Estimate cell type based on differentially expressed genes between groups of cells
There are two modes of cell type imputation available:
Automatic annotation: using the R package scCATCH. Choose a grouping variable (i.e.:‘cluster’) and scCATCH will calculate differential expression and compare the results to their database of markers for healthy and cancer tissues. scCATCH is available for human and murine cells.
Manual Annotation: if differential expression has been previously calculated for the dataset, this section will allow the user to define markers of their choice for their cell types of interest.
Start by selecting a grouping variable, like ‘cluster’, and the tissue type closer to your source material among the ones available in scCATCH’s internatl database.
NOTE: scCATCH’s method to calculate differential expression is not the same as the one used internally in this app, so the results might differ slightly.
A notification will pop up in the bottom right if less than 10 tissue markers are present in your dataset. For more information on how scCATCH works internally, check their publication or their GitHub repository
Here is an example of the automatic annotation done in the PBMC dataset:
Besides 2D plots, the results are avaibalbe to be exported independently in table format:
The manual annotation system allows the user to introduce custom sets of genes linked to biological traits and use them to annotate the cell groups in their scRNA-seq dataset based on the differential expression of said genes. Even though this has ben implemented with cell type imputation in mind, this approach could be used with other cell properties. The recommended mode of differential expression analysis for cell type imputation is ‘1 VS rest’.
To introduce custom markers, type a cell type name and the genes that characerize it and click the ‘Add’ button. Marker sets don’t have to be exclusive, a gene can be linke to more than one cell type. As markers for cell types are added, a table display below the text boxes will be updated. If any markers added are not present in the dataset, a notification will pop up in the bottom right corner:
Cell groups will be assigned a cell type based on the differential expression of the markers selected by the user, returning the cell type whose markers are significantly up-regulated or down-regulated and have a higer mean log2(Fold Change). If none of the introduced markers are significantly up-regulated in a cluster, or the mean log2(FC) of DE genes is below the minimum log2(FC) threshold set when calculating differential expression, it will be marked as “Undetermined”.
Some considerations:
Unlike experimental techniques such as immunohistochemical staining, in this approach cell types are not assigned based on absolute expression of a marker, but on relative over-expression compared to the rest of cells in the dataset. This is based on the assumption that cell type will biggest influence on gene expression overall, so much so that the different cell clusters generated will be primarily linked to this trait. Take this into consideration when using markers linked to characteristics other than cell type, as well as when using a grouping variable in the dataset different from assigned clusters.
The metric used to compare the different sets of gene makers is a mean of log2(FC) values, so even though is possible to use only one gene to characterize a cell type, it is recommended to use at least two or three. In case only one gene is set as a cell type marker, please review the results.
Using classical markers for peripheral blood with the PBMC dataset:
| Cell.Type | Markers |
|---|---|
| Naive CD4+ T | IL7R, CCR7 |
| CD14+ Monocyte | CD14, LYZ |
| Memory CD4+ | IL7R, S100A4 |
| B cell | MS4A1 |
| CD8+ T | CD8A |
| FCGR3A+ Monocyte | FCGR3A, MS4A7 |
| Natural killer | GNLY, NKG7 |
| Dendritic cell | FCER1A, CST3 |
| Platelet | PPBP |
The results in this section will consist of three elements: annotation plots, cell imputation table, and marker information table.
One plot is generated for each cell group, with a 2D reduction plot on the left side -if available- and a barplot displaying the differential expression values of each group of marker on the right. The bars in the plot are arrange to form a circle, which the minimum value in the center and the maximum in the outward perimeter, with the numeric scale at the center top of the circle. Genes whose differential expression is significant will be colored in saturated and bright colors. If differential expression is not significant, the log2(FC) will be colored in muted, unsaturated hues. Non-expressed genes are assigned a value of 0.
Example of three the resulting plots for manual annotation, in this case for clusters 0 (A), 4 (B), and 13 (C)
This table gathers information on the assignment of a cell type to each cluster, cointaining the following:
| Group | Size | DE analysis mode | Num of DEGs | Num of up genes | Num of down genes | Markers UpReg | Markers DownReg | Putative cell type | |
|---|---|---|---|---|---|---|---|---|---|
| 0 VS rest | 0 | 2025 | 1 VS rest | 2197 | 1589 | 608 | CD14, LYZ, S100A4, CST3 | IL7R, CCR7, IL7R | CD14+ Monocyte |
| 1 VS rest | 1 | 1711 | 1 VS rest | 2145 | 1738 | 407 | CD14, LYZ, S100A4, MS4A7, CST3 | IL7R, CCR7, IL7R, FCGR3A | CD14+ Monocyte |
| 2 VS rest | 2 | 1448 | 1 VS rest | 1892 | 321 | 1571 | IL7R, CCR7, IL7R | CD14, LYZ, S100A4, MS4A7, NKG7, CST3 | Naive CD4+ T |
| 3 VS rest | 3 | 1288 | 1 VS rest | 1327 | 353 | 974 | IL7R, IL7R | CD14, LYZ, MS4A7, CST3 | Memory CD4+/Naive CD4+ T |
| 4 VS rest | 4 | 1287 | 1 VS rest | 2034 | 354 | 1680 | IL7R, CCR7, IL7R, CD8A | CD14, LYZ, S100A4, MS4A7, CST3 | CD8+ T |
| 5 VS rest | 5 | 745 | 1 VS rest | 1194 | 461 | 733 | CD8A, FCGR3A, GNLY, NKG7 | CCR7, CD14, LYZ, MS4A7, CST3 | CD8+ T |
| 6 VS rest | 6 | 724 | 1 VS rest | 1556 | 384 | 1172 | MS4A1 | IL7R, CD14, LYZ, IL7R, S100A4, MS4A7, CST3 | B cell |
| 7 VS rest | 7 | 588 | 1 VS rest | 1607 | 1349 | 258 | S100A4, FCGR3A, MS4A7, CST3 | IL7R, CCR7, CD14, LYZ, IL7R | FCGR3A+ Monocyte |
| 8 VS rest | 8 | 473 | 1 VS rest | 1426 | 702 | 724 | FCGR3A, GNLY, NKG7 | IL7R, CCR7, CD14, LYZ, IL7R, MS4A7, CST3 | Natural killer |
| 9 VS rest | 9 | 404 | 1 VS rest | 564 | 550 | 14 | CD14, LYZ | GNLY, NKG7 | CD14+ Monocyte |
| 10 VS rest | 10 | 394 | 1 VS rest | 1368 | 350 | 1018 | MS4A1 | IL7R, CD14, LYZ, IL7R, S100A4, MS4A7, CST3 | B cell |
| 11 VS rest | 11 | 189 | 1 VS rest | 1376 | 1273 | 103 | LYZ, S100A4, FCER1A, CST3 | IL7R, CCR7, IL7R | Dendritic cell |
| 12 VS rest | 12 | 137 | 1 VS rest | 683 | 244 | 439 | IL7R, IL7R, CD8A, GNLY, NKG7 | CCR7, CD14, LYZ, MS4A7, CST3 | Natural killer |
| 13 VS rest | 13 | 101 | 1 VS rest | 2055 | 1794 | 261 | FCER1A, CST3 | IL7R, CD14, LYZ, IL7R, S100A4, MS4A7 | Dendritic cell |
This table has a brief summary on the cell type markers introduced and their expression in the dataset, which allows to detect if any gene might not be expressed in enough quantity to function as a cell type marker. The table includes the following fields:
| Gene | Cell type | Pct cells expressing | mean CPM in expressing cells |
|---|---|---|---|
| IL7R | Naive CD4+ T | 41.51 | 1171.18 |
| CCR7 | Naive CD4+ T | 31.27 | 373.42 |
| CD14 | CD14+ Monocyte | 37.43 | 583.55 |
| LYZ | CD14+ Monocyte | 59.69 | 8554.46 |
| IL7R | Memory CD4+ | 41.51 | 1171.18 |
| S100A4 | Memory CD4+ | 85.21 | 2551.64 |
| MS4A1 | B cell | 11.40 | 1162.62 |
| CD8A | CD8+ T | 17.00 | 361.51 |
| FCGR3A | FCGR3A+ Monocyte | 20.79 | 721.48 |
| MS4A7 | FCGR3A+ Monocyte | 29.79 | 420.53 |
| GNLY | Natural killer | 15.19 | 5699.93 |
| NKG7 | Natural killer | 22.61 | 2517.98 |
| FCER1A | Dendritic cell | 1.93 | 403.96 |
| CST3 | Dendritic cell | 49.20 | 2068.29 |
| PPBP | Platelet | 1.09 | 3134.86 |
The aim of this section is to find which metabolic pathways might affected by the differential expression observed. To that end, we will examine if there is an overrepresentation of differentially expressed genes among elements linke to a particular pathway.
The app has stored a collection of gene sets from several pathway databases (Biocarta, Reactome, etc…) representing a wide array of molecular pathways. Here we will use the results from the differential expression analysis section to calculate if there is an over-representation of DEGs in each pathway.
The process goes as follows: * Get a list of the genes expressed in the cells tested. For each DE analysis, the first step will be to get a list of the genes that could be considered as “expressed” among the cells in which DE was tested. For this, we’ll take the same threshold as in differential expression analysis: by deafult 25% of the cells analyzed must have read counts > 0 for a gene to be considered “expressed”. * Generate contingency tables. The next step is to count how many of those expressed genes are differentially expressed, and for each pathway gene set, how many genes are part of the pathway:
| Are the genes part of this pathway’s gene set? | |||
|---|---|---|---|
| Yes | No | ||
| Are the genes | Yes | a | b |
| differentially expressed? | No | c | d |
A contingency table will be made for each pathway gene set. From this
contingency table we can calculate an Odds
Ratio as (a/b) / (c/d), which quantifies the level
of over-representation that differentially expressed genes have in that
gene set, and a p-value using Fisher’s exact test. P-values will
afterwards be correcter for multiple testing by FDR. * Odds Ratios >
1 indicate enrichment of DEGs in that pathway * Odds Ratios < 1
indicate depletion of DEGs in that pathway
After running the analysis, a summary will be displayed in the screen, and the detailed results will be available in excel format (.xlsx). A table of result will be generated for each differential expression analysis selected. The tables in the full results contain:
There’s an example below of the Likert type plots generated to ilustrate DEG enrichment in metabolic pathways. In this case we used the data of endothelial cells from an experiment on lung samples from mice exposed or not to intermitent hypoxia, so after processing the dataset our cells are divided by two variables: treatment and cluster. The four plots below represent:
As a general rule, differential expression between different clusters will result in larger numbers of differentially expressed genes than if we test cells of the same cluster separated by another criteria (treatment, sample, etc), so it’s expected for conditional differential expression analysis to also yield less significant enrichment in pathway analysis as well.
This application implements tools for single cell RNA-seq analysis published in the following R packages:
Experiments used in this guide:
In addition, this application makes use of the following resources:
MSigDB v7.5.1 collection of genesets characterizing metabolic pathways annotated in public databases. such as: